Giaosam's Blog


  • Home

  • Archives

Notes: Deep Learning Specialization - Structuring Machine Learning Projects

Posted on 2019-04-22

Introduction to ML Strategy

Orthogonalization

Orthogonalization is a process of reasonably tuning hyperparameters to achieve a high-level performance when building deep learning systems:

  1. Fit training set well on cost function ($\approx$ human-level performance)
    If not $\rightarrow$ train a bigger network / switch to Adam optimization algorithm
  2. Fit dev set well on cost function
    If not $\rightarrow$ use regularization / bigger training set
  3. Fit test set well on cost function
    If not $\rightarrow$ get a bigger dev set
  4. Perform well in real world
    If not $\rightarrow$ change dev set or cost function

Setting up the Goal

Single Number Evaluation Metric

Having a single real number evaluation metric that tells about the new performance is working better or worse than the last idea can lead to making faster progress.

Classifier Precision Recall $F_1 \ score$
A 95% 90% 92.4%
B 98% 85% 91.0%

Take the cat recognition as an example. 95% precision of classifier A means that classifier A has 95% chance to successfully predict it is a cat. Recall is the percentage of images correctly recognized by the classifier as cats over all real cat images.

Rather than using two numbers, precision and recall to pick a classifier, the function called harmonic mean of precision $P$ and recall $R$ is more recommended.
$$F_1 \ score = \frac 2{\frac 1{P} + \frac 1{R}}$$

Having a well-defined dev set used to measure precision and recall, plus a single number evaluation metric can speed up iterating.

Another example: compute the average of performance or loss value

Algorithm U.S. China India Other $Average$
A 3% 7% 5% 9% 6%
B 5% 6% 5% 10% 6.5%
C 2% 3% 4% 5% 3.5%
D 5% 8% 7% 2% 5.25%
E 4% 5% 2% 4% 3.75%
F 7% 11% 8% 12% 9.5%

Satisficing and Optimizing Metric

Another cat classification example: care about the running time in addition to accuracy

Classifier Accuracy Running time
A 90% 80ms
B 92% 95ms
C 95% 1,500ms

We can combine accuracy and running time into an overall evaluation metric:
$cost = accuracy - 0.5 * running\_ time$

Goal: choose a classifier that maximizes accuracy however subject to the running time $\le 100$ms.

In this case, accuracy is an optimizing metric because we want to maximize accuracy. Running time is a satisfying metric, meaning that it just has to be good enough, it just needs to be less than $100$ms and beyond that we don’t really care.

$N$ metrics:

  • $1$ optimizing
  • $(N-1)$ satisfying

Train / Dev / Test Distributions

A bad example of setting up dev and test set:
$\begin{aligned}
& \text{Regions:} \cr
& Dev = \begin{cases} US \cr UK \cr Other \ Europe \cr South \ America \end{cases} \quad Test = \begin{cases} India \cr China \cr Other \ Asia \cr Australia \end{cases}
\end{aligned}$

The problem with how we’ve set up the dev and test sets in the example above is that, we might spend months innovating to do well on the dev set only to realize that data from test set might be very different than the data in the dev set, i.e., all the months of work spent optimizing to the dev set does not give a good performance on the test set.

Make the dev and test sets come from the same distribution!

When to Change Dev / Test Sets and Metrics

Comparing to Human-level Performance

Machine Learning Fight Simulator

[pic1]:

Dense Layer

Posted on 2019-04-20

Dense

Just the regular fully-connected (densely) NN layer
A linear operation in which every input is connected to every output by a weight, generally followed by a non-linear activation function.

Dense implements the operation: $output = activation(dot(input, \ W) + b)$ where $activation$ is the element-wise activation function passed as the activation argument, $W$ is a weights matrix created by the layer, and $b$ is a bias vector created by the layer.

Example in TensorFlow:

1
2
3
4
5
6
7
8
9
# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify
# the size of the input anymore:
model.add(Dense(32))

References
tf.keras.layers.Dense

Mention in passing:
At first I regarded Dense layer as something interesting as Dropout layer or Convolutional layer, which whetted my appetite to keep a note of its concept and theory. After founding that it is just the fully-connected layer, I still wrote the blog because I was tired of writing a very long blog. A easy and short blog may also be a good choice.

Notes: Deep Learning Specialization - Improving Deep Neural Networks (Week3)

Posted on 2019-04-15

Hyperparameter Tuning

Tuning Process

Hyperparameters from the most important one to the not so important one:
$\begin{cases}
\color{red}{\alpha} \cr
\color{orange}{\beta} \ \text{(default is)} \ 0.9 \cr
\color{orange}{\# hidden \ units} \cr
\color{orange}{mini-batch \ size} \cr
\beta_1, \beta_2, \epsilon \cr
\# layers \cr
\text{learning rate decay} \cr
\end{cases}$

Try random values: Don’t use a grid
In earlier generations of machine learning algorithms, if there are two hyperparameters, $Hyperparameter \ 1$ and $Hyperparameter \ 2$, to be tuned, it was common practice to sample the points in a grid like the following one and then systematically explore these values. This practice works okay when the number of hyperparameters was relatively small.
image

In deep learning, it is recommended to choose the points at random, because it’s difficult to know in advance which hyperparameters are going to be the most important for the problem.

To illustrate, let’s say $Hyperparameter \ 1$ turns out to be the learning rate $\alpha$. And to take an extreme example, let’s say that $Hyperparameter \ 2$ was $\epsilon$ in the denominator of the Adam algorithm. The choice of $\alpha$ matters a lot while the choice of $\epsilon$ hardly matters. Therefore, if we sample in the grid then we’ve really tried out five values of $\alpha$ and you might find that all of the different values of $\epsilon$ give us essentially the same answer.

In contrast, sampling at random leads to 25 distinct values of the learning rate $\alpha$ and therefore it will be more likely to find a value that works really well.
image

Use a coarse to fine sampling scheme
After doing a coarse sample of the entire square which might tell that a few points around a certain region tended to work really well, then we can sample more densely into smaller square.
image

Using an Appropriate Scale to Pick Hyperparameters

When searching for the hyperparameter $\alpha$ in the range from $0.0001$ to $1$, sampling values uniformly at random isn’t reasonable, because we are using 90% of the resources to search between $0.1$ and $1$, and only 10% of the resources to search between $0.0001$ and $0.1$. Instead, it is more reasonable to search for hyperparameters on a log scale.
$\begin{aligned}
& \alpha : 0.0001 \cdots 0.001 \cdots 0.01 \cdots 0.1 \cdots 1 \cr
& \cr
& r = -4 * np.random.rand() \ \leftarrow \ r \in [-4, 0] \cr
& \alpha = 10^r \ \leftarrow \ \alpha \in [10^{-4}, 10^0] \cr
\end{aligned}$

Hyperparameters for exponentially weighted averages
$\begin{aligned}
& \beta : 0.9 \cdots 0.999 \cr
& 1 - \beta : 0.1 \cdots 0.001 \cr
& \cr
& r \in [-3, -1], \ 1 - \beta = 10^r \cr
& \beta = 1 - 10^r \cr
\end{aligned}$

Why is it such a bad idea to sample in a linear scale?

  • $\beta : 0.9000 \ \rightarrow \ 0.9005$:
    causes hardly any change in the results, averaging over roughly 10 values of data.
  • $\beta : 0.9990 \ \rightarrow \ 0.9995$:
    has a huge impact on the results, from averaging over 1000 values of data to averaging over 2000 values of data, because of $\frac 1{1 - \beta}$.

Hyperparameters Tuning in Practice: Pandas vs. Caviar

  • Babysitting one model:
    Have a huge dataset but not a lot of computational resources, not a lot of CPUs and GPUs, basically affording to train only one model or a very small number of models at a time. People who babysit the model watch the model’s performance and patiently nudging the learning rate up or down.
  • Training many models in parallel:
    Try a lot of different hyperparameter settings and then at the end quickly pick the one that works best. ]

Batch Normalization

Batch normalization makes the hyperparameter search problem much easier, makes the neural network much more robust. This technique provides a much bigger range of hyperparameters that work well, and will also enable people to much more easily train even very deep networks.

Normalizing Activations in a Network

For any hidden layer, normalize the values of $z^{[l]}$ in the $l$th hidden layer, so as to train $W^{[l + 1]}, b^{[l + 1]}$ faster.

Implementing Batch Norm
Given some intermediate values in the hidden layer $l$ of a neural network: $z^{[l] (1)} \cdots z^{[l] (m)}$
$\begin{aligned}
& \mu = \frac 1m \sum_{i = 1}^m z^{[l] (i)} \cr
& \sigma ^2 = \frac 1m \sum_{i=1}^m \left( z^{[l] (i)} - \mu \right) ^2 \cr
& z_{norm}^{[l] (i)} = \frac {z^{[l] (i)} - \mu} {\sqrt {\sigma ^2 + \epsilon}} \quad (\text{add} \ \epsilon \ \text{for numerical stability}) \cr
\end{aligned}$

Now, every component of $z$ has mean $0$ and variance $1$. But we don’t want the hidden units to always have mean $0$ and variance $1$. Maybe it makes sense for hidden units to have a different distribution:
$$\color {red} {\tilde{z}^{[l] (i)} = \gamma \cdot z_{norm}^{[l] (i)} + \beta}$$

$\gamma$ and $\beta$ are learnable parameters which should be updated as the weights of the neural network. The effect of $\gamma$ and $\beta$ is that it allows us to set the mean of $\tilde{z}^{[l] (i)}$ to be whatever we want it to be.
$\begin{aligned}
& \text{If} \cr
& \quad \gamma = \sqrt {\sigma ^2 + \epsilon}, \ \beta = \mu \cr
& \text{then} \cr
& \quad \tilde{z}^{[l] (i)} = z^{[l] (i)}
\end{aligned}$

Fitting Batch Norm into a Neural Network

$$a^{[l-1]} \xrightarrow {W^{[l]}, \ b^{[l]}} z^{[l]} \xrightarrow [\text{Batch Norm}] {\beta^{[l]}, \ \gamma^{[l]}} \tilde{z}^{[l]} \longrightarrow a^{[l]} = g^{[l]} (\tilde{z}^{[l]})$$

In practice, Batch Norm is usually applied with mini-batches of the training set.

Eliminate $b^{[l]}$ when using Batch Norm:
$\begin{aligned}
& z^{[l]} = W^{[l]} a^{[l-1]} \cr
& \text{Compute} \ z_{norm}^{[l]} \cr
& \tilde{z}^{[l]} = \gamma^{[l]} z_{norm}^{[l]} + \beta^{[l]} \cr
\end{aligned}$

Owing to computing the means of the ZL’s and subtract the mean during the Batch Normalization step, adding any constant to all of the examples in the mini-batch doesn’t change anything. Instead, replace $b^{[l]}$ with $\beta^{[l]}$, which is a parameter affecting the shift or the biased terms.

Implementing gradient descent with Batch Norm (BN)
$\begin{aligned}
& \text{for} \ t=1 \cdots n \ \text{mini-batches} \cr
& \qquad \text{Compute forward propagation on} \ X^{\lbrace t \rbrace} \cr
& \qquad \qquad \text{In each hidden layer, use BN to replace} \ z^{[l]} \ \text{with} \ \tilde{z}^{[l]} \cr
& \qquad \text{Use backprop to compute} \ dW^{[l]}, \ d\beta^{[l]}, \ d\gamma^{[l]} \cr
& \qquad \text{Update parameters:} \cr
& \qquad \qquad W^{[l]} = W^{[l]} - \alpha \ dW^{[l]} \cr
& \qquad \qquad \beta^{[l]} = \beta^{[l]} - \alpha \ d\beta^{[l]} \cr
& \qquad \qquad \gamma^{[l]} = \gamma^{[l]} - \alpha \ d\gamma^{[l]} \cr
\end{aligned}$

Why Does Batch Norm Work?

It makes weights, which is later or deeper in the network, more robust to changes to weights in earlier layers of the neural network.

Take a shallow network for example, like the cat detection toss. At first, train the datasets on all images of black cats. Now, if we try to apply this network to data with colored cats where the positive examples are not just black cats like on the left, then the classifier might not work very well. Even though there might be the same function that actually works well, but the learning algorithm is not likely to discover that green decision boundary.
image

This idea of the data distribution changing is called “covariate shift”:
When having learned some $X$ to $Y$ mapping, if the distribution of $X$ changes, then it might be necessary to retrain the learning algorithm, even the ground true function mapping from $X$ to $Y$ remains unchanged.

Why this is a problem with neural networks?
image

When covering the left part of the network, the network can do a good job to map from the values $a_1^{[2]} \cdots a_4^{[2]}$ to the output values $\hat{y}$. However, after uncovering the left of the network again, the network is also adapting parameters $W^{[2]}, \ b^{[2]}$ and $W^{[1]}, \ b^{[1]}$, and so as these parameters change, these values $a_1^{[2]} \cdots a_4^{[2]}$ will also change. Hence, from the perspective of the third hidden layer, these hidden unit values are changing all the time, and therefore it is suffering from the problem of covariate shift.

What batch norm does is that, it reduces the amount that the distribution of the hidden unit values shifts around.

Plot the distribution of these hidden unit values. Plot only two values $z_1^{[2]}$ and $z_2^{[2]}$ instead of four values, in order to visualize in 2D. The values for $z_1^{[2]}$ and $z_2^{[2]}$ can change, and indeed they will change when the neural network updates the parameters in the earlier layers. But what batch norm ensures is that no matter how it changes, the mean and variance of $z_1^{[2]}$ and $z_2^{[2]}$ will remain the same.
image

Batch Norm weakens the coupling between the early layers’ parameters and the later layers’ parameters, and it allows each layer of the network to learn more independently of other layers, which has the effect of speeding up of learning in the whole network.

Batch Norm as regularization

  • Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
  • This adds some noise to the values $z^{[l]}$ within that mini-batch, so similar to dropout, it adds some noise to each hidden layer’s activations.
  • This has a slight regularization effect.

Batch Norm at Test Time

$\mu, \ \sigma^2$: estimate them using exponentially weighted averages (across the mini-batches).
Pick layer $l$ and go through mini-batches $X^{\lbrace 1 \rbrace}, \ X^{\lbrace 1 \rbrace} \cdots$ together with the corresponding values of $Y$:
$\begin{matrix}
X^{\lbrace 1 \rbrace}, & X^{\lbrace 2 \rbrace}, & X^{\lbrace 3 \rbrace}, & \cdots & \cr
\downarrow & \downarrow & \downarrow & \cdots & \cr
\mu^{\lbrace 1 \rbrace [l]}, & \mu^{\lbrace 2 \rbrace [l]}, & \mu^{\lbrace 3 \rbrace [l]}, & \cdots & \leadsto \ \text{exponentially weighted average} \ \mu \cr
(\sigma ^2) ^{\lbrace 1 \rbrace [l]}, & (\sigma ^2) ^{\lbrace 2 \rbrace [l]}, & (\sigma ^2) ^{\lbrace 3 \rbrace [l]}, & \cdots & \leadsto \ \text{exponentially weighted average} \ \sigma^2 \cr
\end{matrix}$

Multi-class Classification

Softmax Regression

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes.

Recognizing cats, dogs, and baby chicks: $C = \# classes = 4$
image

In this case, we’re going to build a new neural network, where the output layer has four units, so the output labels $\hat{y}$ is going to be a $(4, 1)$ dimensional vector, representing four probabilities.
image

Softmax activation function
$$\begin{aligned}
& a_i^{[L]} = g^{[L]} (z_i^{[L]}) = \frac { e ^{ z_i^{[L]}} } { \sum_{i=1} ^{ n^{[L]} } e ^{ z_i^{[L]}} } \cr
& a^{[L]} = \frac {e ^{ z^{[L]}}} {\sum e ^{ z^{[L]}}} \cr
\end{aligned}$$

To illustrate,
$\begin{aligned}
& z^{[L]} = \begin{bmatrix} 5 \cr 2 \cr -1 \cr 3 \end{bmatrix} \cr
& \cr
& t = \begin{bmatrix} e^5 \cr e^2 \cr e^{-1} \cr e^3 \end{bmatrix} = \begin{bmatrix} 148.41 \cr 7.39 \cr 0.37 \cr 20.09 \end{bmatrix}, \quad \sum_{j=1}^4 t_j = 176.26 \cr
& \cr
& a^{[L]} = \frac t{176.26} = \begin{bmatrix} \frac {e^5} {176.26} \cr \frac {e^2} {176.26} \cr \frac {e^{-1}} {176.26} \cr \frac {e^3} {176.26} \end{bmatrix} = \begin{bmatrix} 0.842 \cr 0.042 \cr 0.002 \cr 0.114 \end{bmatrix}, \quad \sum_{j=1}^4 a_j^{[L]} = 1 \cr
\end{aligned}$

Softmax examples:
Here is a neural network without hidden layers. Feed inputs $x_1$, $x_2$ directly to a Softmax layer that has three, four, or more output nodes.
image

Training a Softmax Classifier

Softmax regression generalizes logistic regression to $C$ classes.
If $C = 2$, softmax reduces to logistic regression.

Loss function
$\begin{aligned}
\mathcal {L} (\hat{y}, \ y) = -\sum_{j = 1}^C y_j \log \hat{y}_j
\end{aligned}$

Assume the output target $y = \begin{bmatrix} 0 \cr 1 \cr 0 \cr 0 \end{bmatrix}$, and the prediction $\hat{y} = \begin{bmatrix} 0.3 \cr 0.2 \cr 0.1 \cr 0.4 \end{bmatrix}$.

$\begin{aligned}
\therefore \ \mathcal {L} (\hat{y}, \ y) &= -\sum_{j = 1}^C y_j \log \hat{y}_j \cr
& = -y_2 \log \hat{y}_2 = -\log \hat{y}_2 \cr
\end{aligned}$

In order to make $\mathcal {L} (\hat{y}, \ y)$ small, the only way is to make $-\log \hat{y}_2$ small, in other words, to make $\hat{y}_2$ as big as possible.

This loss function tries to make the corresponding probability of a class as high as possible, whatever is the ground true class in the training set.

Cost function
$\begin{aligned}
J(W^{[1]}, b^{[1]}, \cdots W^{[L]}, b^{[L]}) = \frac 1m \sum_{i=1}^m \mathcal {L} (\hat{y}^{(i)}, y^{(i)})
\end{aligned}$

Introduction to Programming Frameworks

Programming Assignment: Tensorflow Tutorial

Writing and running programs in TensorFlow has the following steps:

  1. Create placeholders
    x = tf.placeholder(tf.float32, shape=(1024, 1024), name='x')
  2. Specify the computation graph corresponding to operations you want to compute
  3. Create the session
  4. Run the session, using a feed dictionary if necessary to specify placeholder variables’ values.
    sess.run(..., feed_dict = {x: data})

Using one hot encodings
Many times in deep learning, we will have a $y$ vector with numbers ranging from $0$ to $C-1$, where $C$ is the number of classes. If $C$ is for example $4$, then we might have the following $y$vector which we will need to convert as follows:
image

This is called a “one hot” encoding, because in the converted representation exactly one element of each column is “hot” (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code:
tf.one_hot(labels, depth, axis)

Backward propagation & parameter updates
All the backpropagation and the parameters update is taken care of in 1 line of code:
Create an “optimizer” object and call this object along with the cost when running the $tf.session$. When called, it will perform an optimization on the given cost with the chosen method and learning rate.

  • For gradient descent the optimizer would be:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
  • To make the optimization:
    _ , c = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})

When coding, we often use _ as a “throwaway” variable to store values that we won’t need to use later. Here, _ takes on the evaluated value of optimizer, which we don’t need (and c takes the value of the cost variable).

Notes: Deep Learning Specialization - Improving Deep Neural Networks (Week2)

Posted on 2019-04-07

Optimization Algorithm

Mini-Batch Gradient Descent

Mini-batch gradient descent runs much faster than batch gradient descent.

  • Shuffle
    Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.
    image

  • Partition
    Partition the shuffled $(X, Y)$ into mini-batches of size $mini\_ batch\_ size$. Note that the number of training examples is not always divisible by $mini\_ batch\_ size$. The last mini batch might be smaller. When the final mini-batch is smaller than the full $mini\_ batch\_ size$, it will look like this:
    image

If the number of training examples $m = 5,000,000$, set mini-batch = $1000$, so there are $5000$ mini-batches.
1 step of mini-batch gradient descent using the $t^{th}$ mini-batch $X^{\lbrace t \rbrace}, Y^{\lbrace t \rbrace}$
$\begin{align}
& “1 \ epoch”\ – \ \text{pass through the training set:} \cr
& \text{for} \ t = 1 \ \text{to} \ 5000 \ \lbrace \cr
& \qquad \text{Forward propagation on} \ X^{\lbrace t \rbrace} \cr
& \qquad \text{Compute cost} \ J^{\lbrace t \rbrace} = \frac 1{1000} \sum_{i=t}^{t+1000} \mathcal{L} \left( \hat{y}^{(i)}, y^{(i)} \right) + \frac {\lambda}{2 \cdot 1000} \sum_{l=1}^{L} \Vert W^{[l]} \Vert_{_F}^2 \cr
& \qquad \text{Backward propagation to compute gradients w.r.t} \ J^{\lbrace t \rbrace} \ (\text{using} (X^{\lbrace t \rbrace}, Y^{\lbrace t \rbrace})) \cr
& \qquad \text{For every layer,} \ W^{[l]} = W^{[l]} - \alpha dW^{[l]}, \ b^{[l]} = b^{[l]} - \alpha db^{[l]} \cr
& \rbrace \cr
\end{align}$

Understanding Mini-batch Gradient Descent

Training with mini-batch gradient descent:
image

Choosing the mini-batch size:
image

Stochastic gradient descent Mini-batch gradient descent Batch gradient descent
mini-batch size = $1$ mini-batch size $\in (1, m)$, not too big and not too small mini-batch size = $m$
every example is its own mini-batch … only have one mini-batch $\left( X^{\lbrace 1 \rbrace} , Y^{\lbrace 1 \rbrace} \right) = (X, Y)$
lose speedup from vectorization the practical solution and the fastest learning – make progress without processing the entire training set take too much time per iteration if the size of the training set is very large
– $2^n$ is the typical mini-batch size: $64, 128, 256, 512, \cdots$ when the size of training set is small: $m \le 2000$

Try a few different powers of two and then see if we can pick one that makes the gradient descent optimization algorithm as efficient as possible.

Make sure the mini-batches, all the $\left( X^{\lbrace t \rbrace} , Y^{\lbrace t \rbrace} \right)$, fit in CPU/GPU memory.

Exponentially Weighted Averages

Temperature in London:
$\begin{aligned}
&\theta_1 = 40^{\circ} F \cr
&\theta_2 = 49^{\circ} F \cr
&\theta_3 = 45^{\circ} F \cr
& \quad \vdots \cr
&\theta_{180} = 60^{\circ} F \cr
&\theta_{181} = 56^{\circ} F \cr
& \quad \vdots \cr
\end{aligned}$

To compute the trends, the local average or a moving average of the temperature, here is what we should do:
$\begin{aligned}
\text{Initialize} \ &v_0 = 0 \cr
\text{then, } \ &v_1 = 0.9 v_0 + 0.1 \theta_1 \cr
&v_2 = 0.9 v_1 + 0.1 \theta_2 \cr
&v_3 = 0.9 v_2 + 0.1 \theta_3 \cr
& \vdots \cr
&v_t = 0.9 v_{t-1} + 0.1 \theta_t \cr
\end{aligned}$

image

Exponentially Weighted (Moving) Averages:
$v_t$ as approximately averaging over $\frac 1{1 - \beta}$ days’ temperature.
$$v_t = \beta v_{t-1} + (1 - \beta) \theta_t$$

  • $\beta = 0.5$: $\approx 2$ days’ temperature.
    By averaging over a larger window, the exponentially weighted average formula adapts more slowly when the temperature changes. Hence, there’s a bit more latency. And the reason for that is when $\beta$ is 0.98 then it gives a lot of weight to the previous value and a much smaller weight to the current new data.
  • $\beta = 0.98$: $\approx 50$ days’ temperature.
    By averaging over a much shorter window. There exists much more noisy, much more susceptible to outliers. But this adapts much more quickly to what the temperature changes.

Understanding Exponentially Weighted Averages

$\begin{aligned}
& \text{For} \ \beta = 0.9 \cr
& v_{100} = 0.9 v_{99} + 0.1 \theta_{100} \cr
& v_{99} = 0.9 v_{98} + 0.1 \theta_{99} \cr
& v_{98} = 0.9 v_{97} + 0.1 \theta_{98} \cr
& \quad \vdots \cr
& v_{100} = 0.1 \ \theta_{100} + 0.1 \times 0.9 \ \theta_{99} + 0.1 \times (0.9)^2 \ \theta_{98} + 0.1 \times (0.9)^3 \ \theta_{97} + \cdots \cr
\end{aligned}$

$\begin{aligned}
\because \ & \lim \limits_{\epsilon \to 0} (1 - \epsilon)^{\frac 1{\epsilon}} = \frac 1e \cr
\therefore \ & 0.9^{10} = (1 - \frac 1{10})^{10} \approx \frac 1e \cr
& 0.98^{50} = (1 - \frac 1{50})^{50} \approx \frac 1e \cr
\end{aligned}$

When $\beta = 0.98$, the exponentially weighted averages will be larger than $\frac 1e$ for the first 50 days, and then they’ll decay quite rapidly over that. Intuitively, this can be regarded as averaging over about 50 days temperature. Because, $\epsilon$ replace $1 - \beta$, which tells that how many days’ temperature should be thought of this as averaging over.

Implementing exponentially weighted averages
$\begin{aligned}
& v_{\theta} = 0 \cr
& \text{Repeat} \ \lbrace \cr
& \quad \text{Get next} \ \theta_t \cr
& \quad v_{\theta} = \beta v_{\theta} + (1 - \beta) \theta_t \cr
& \rbrace \cr
\end{aligned}$

Bias Correction in Exponentially Weighted Averages

Biased correction can make the computation of the exponentially weighted averages more accurately.

image

The purple curve starts off really low:
$\require{enclose}
\begin{aligned}
\text{Initialize} \ & v_0 = 0 \cr
& v_1 = \enclose{horizontalstrike}{0.98 v_0 + } \ 0.02 \ \theta_1 \cr
& v_2 = 0.98 \times 0.02 \ \theta_1 + 0.02 \ \theta_3 \cr
& \cr
& \because \ \theta_1 = 40^{\circ} F \ \text{and} \ \theta_2 = 49^{\circ} F \cr
& \therefore \ v_1 = 0.08^{\circ} F, \quad v_2 \approx 0.98^{\circ} F \cr
\end{aligned}$

A more accurate technique which can move the bias during the initial phase:
$$v_t = \frac {v_t}{1 - \beta^t}$$

If $t$ is very large, then the bias correction makes almost no difference.
$\begin{aligned}
\lim \limits_{t \to \infty} \frac {v_t}{1 - \beta^t} = \frac {v_t}{1 - 0} = v_t
\end{aligned}$

This is why when $t$ is large, the purple line and the green line pretty much overlap. The bias correction can help obtain a better estimate of the data (temperature), during this initial phase of learning.

In machine learning, for most implementations of the exponential weighted average, people don’t often bother to implement bias corrections. Because most people would rather wait for initial period and have a slightly more biased estimate and go from there.

Gradient Descent with Momentum

Momentum takes past gradients into account to smooth out the steps of gradient descent. Gradient descent with momentum almost always works faster than the standard gradient descent algorithm.

On the vertical axis, we want the learning to be a bit slower, because we don’t want those oscillations. However, on the horizontal axis, we want faster learning:
image

Momentum
$\begin{aligned}
& \text{On iteration} \ t: \cr
& \quad \text{Compute} \ dW, db \ \text{on the current mini-batch.} \cr
& \quad v_{dW} = \beta \cdot v_{dW} + (1 - \beta) \cdot dW \cr
& \quad v_{db} = \beta \cdot v_{db} + (1 - \beta) \cdot db \cr
& \cr
& \quad W = W - \alpha \ v_{dW}, \quad b = b - \alpha \ v_{db} \cr
\end{aligned}$

After averaging out these gradients, which averages out positive and negative number the oscillations in the vertical direction will tend to average out to something closer to zero. Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big.

How do you choose $\beta$?

  • The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much.
  • Common values for $\beta$ range from 0.8 to 0.999. The most common value for $\beta$ is $0.9$, which is averaging of the last ten iteration’s gradients.

An example of a ball rolling downhill from a bowl
Imagine that there is a bowl and a ball rolling downhill from the bowl:
image

The derivative terms $dW, db$ can be regarded as providing acceleration to a ball which is rolling downhill, and the momentum terms $v_{dW}, v_{db}$ can be regarded as representing the velocity.

The derivative imparts acceleration to the ball as the ball is rolling downhill, which rolls faster and faster because of acceleration. And because $\beta < 1$, it displays a row of friction and it prevents the ball from speeding up without limit. Rather than gradient descent which takes every single step independently of all previous steps, this ball can roll and accelerate downhill and gain momentum.

RMSprop

Root Mean Square prop (RMSprop) can also speed up gradient descent.
$\begin{aligned}
& \text{On iteration} \ t: \cr
& \quad \text{Compute} \ dW, db \ \text{on the current mini-batch.} \cr
& \quad s_{dW} = \beta_2 \cdot s_{dW} + (1 - \beta_2) \cdot (dW)^2 \ \rightarrow \ \text{element-wise square} \cr
& \quad s_{db} = \beta_2 \cdot s_{db} + (1 - \beta_2) \cdot (db)^2 \cr
& \cr
& \quad W = W - \alpha \frac {dW}{\sqrt{s_{dW}}}, \quad b = b - \alpha \frac {db}{\sqrt{s_{db}}} \cr
\end{aligned}$

image

In the $W$ direction (the horizontal direction), we want learning to go pretty fast, so $s_{dW}$ will be relatively small. In the $b$ direction (the vertical direction), we want to slow down all the oscillations into the vertical direction. $s_{db}$ will be relatively large, so dividing a relatively large number can slow down the updates on a vertical dimension.

In other words, the derivatives are much larger in the vertical direction than in the horizontal direction, so the slope is very large in the $b$ direction, which means the derivatives are a very large $db$ and a relatively small $dW$. Therefore, $(db)^2$ will be relatively large, compared to that $(dW)^2$ will be smaller, and so $s_{dW}$ will be smaller. The updates in the vertical direction are divided by a much larger number, which helps damp out the oscillations. Whereas the updates in the horizontal direction are divided by a smaller number, which speeds up learning.

With RMSprop, it is feasible to use a larger learning rate $\alpha$ and get faster learning without diverging in the vertical direction.

Adam Optimization Algorithm

$\begin{aligned}
& v_{dW} = 0, \ s_{dW} = 0. \ v_{db} = 0, \ s_{db} = 0. \cr
& \text{On iteration} \ t: \cr
& \quad \text{Compute} \ dW, db \ \text{on the current mini-batch.} \cr
& \quad v_{dW} = \beta_1 \cdot v_{dW} + (1 - \beta_1) \cdot dW, \ v_{db} = \beta_1 \cdot v_{db} + (1 - \beta_1) \cdot db \cr
& \quad s_{dW} = \beta_2 \cdot s_{dW} + (1 - \beta_2) \cdot (dW)^2, \ s_{db} = \beta_2 \cdot s_{db} + (1 - \beta_2) \cdot (db)^2 \cr
& \cr
& \quad \text{// Bias correction:} \cr
& \quad v_{dW}^{corrected} = \frac {v_{dW}} {1 - \beta_1^t}, \ v_{db}^{corrected} = \frac {v_{db}} {1 - \beta_1^t} \cr
& \quad s_{dW}^{corrected} = \frac {s_{dW}} {1 - \beta_2^t}, \ s_{db}^{corrected} = \frac {s_{db}} {1 - \beta_2^t} \cr
& \cr
& \quad W = W - \alpha \frac {v_{dW}^{corrected}} {\sqrt { s_{dW}^{corrected} } + \epsilon}, \ b = b - \alpha \frac {v_{db}^{corrected}} {\sqrt { s_{db}^{corrected} } + \epsilon} \cr
\end{aligned}$

Adam (Adaptive Moment Estimation) optimization algorithm combines the effect of gradient descent with momentum together with gradient descent with RMSprop.

Hyperparameter choice:

  • $\alpha$:   needs to be tuned.
  • $\beta_1$:   $0.9$ (default)
  • $\beta_2$:   $0.999$ (default)
  • $\epsilon$:   $10^{-8}$

Some advantages of Adam include:

  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum).
  • Usually works well even with little tuning of hyperparameters (except $\alpha$).

Learning Rate Decay

Slowly reducing the learning rate over time might help speed up the learning algorithm.

When iterating the gradient descents, including mini-bath gradient descent, the learning algorithm will tend towards the minimum but won’t converge exactly, because we are using a fixed value of the learning rate $\alpha$. There’s some noise in the different mini-batches:
image

If we were to slowly reduce the learning rate $\alpha$, then during the initial phases, you can still have relatively fast learning, because $\alpha$ is still large. But then as $\alpha$ gets smaller, each step will be slower and smaller, which will end up oscillating in a tighter region around the minimum, rather than wandering far away.

Different learning rate decay methods ($decay\_ rate$ and $\alpha_0$ are hyperparameters):

  • Method 1:
    $\begin{aligned} \alpha = \frac 1{1 + decay\_ rate * epoch} \alpha_0 \end{aligned}$
  • Method 2: exponential decay
    $\begin{aligned} \alpha = 0.95^{(epoch)} \cdot \alpha_0 \end{aligned}$
  • Method 3:
    $\begin{aligned} \alpha = \frac k{\sqrt{epoch}} \cdot \alpha_0 \ \text{or} \ \alpha = \frac k{\sqrt{t}} \cdot \alpha_0, \ k \ \text{is a constant} \end{aligned}$
  • Method 4: discrete staircase
    $\alpha = \begin{cases}
    \frac 12 \alpha_0, & t \in [1, t_0] \cr
    \frac 14 \alpha_0, & t \in [t_0, 2t_0] \cr
    \frac 18 \alpha_0, & t \in [2t_0, 3t_0] \cr
    \quad \vdots \cr
    \frac 1{2^n} \alpha_0, & t \in [(n-1) t_0, n t_0] \cr
    \end{cases}$
  • Method 5: manual decay

The Problem of Local Optima

It is unlikely to get stuck in bad local optima.
image

Most points of zero gradients are not local optima like points in the left picture. Instead, most points of zero gradient in a cost function are saddle points shown in the right picture.

With a function of very high dimensional space, if the gradient is zero, then in each direction it can either be a convex function or a concave function. For example, a $20000$ dimensional space, then for it to be a local optima, all $20000$ directions need to be convex functions, whose chance of happening is maybe $2^{-20000}$.

Problem of plateaus
Plateaus can make learning pretty slow.
image

In the region of plateau, the gradient is zero or near zero, the surface is quite flat. It can take a very long time to slowly find the way to get off the plateau.

Union−Find

Posted on 2019-04-01

Notes: Deep Learning Specialization - Improving Deep Neural Networks (Week1)

Posted on 2019-03-26

Setting Up the Machine Learning Application

Train / Dev / Test sets

Applied deep learning is a highly iterative process: we have to go around the following circle many times to hopefully find a good choice of network.
image

Hence, how efficiently we can go around the circle can determine how quickly we can make progress.

Splitting training set

  • Projects with $100$ - $10,000$ training examples (ML):
    70% training set / 30% test set, or 60% training set / 20% cross validation set / 20% test set
  • Projects with big data, $1,000,000$ training examples, (DL):
    98% training set / 1% dev set / 1% test set, or 99.5% training set / 0.25% dev set / 0.25% test set.

Cross validation set is also called “dev” (development) set.

Making sure dev and test set come from the same distribution
Mismatched train/test distribution:
Training set: Cat pictures from webpages
Dev/test sets: Cat pictures from users using the app

Not having a test set might be okay (only dev set)
The goal of the test set is to give a unbiased estimate of the performance of the final selected network, but if we do not need a unbiased estimate, then it might be okay to not have a test set.

What we should do: Train on the training set and then try different model. Evaluate them on the dev set, and then use that to iterate and try to get to a good model.

Bias / Variance

image

high variance high bias high bias & high variance low bias & low variance
Training set error 1% 15% 15% 0.5%
Dev set error 11% 16% 30% 1%

The above analysis is based on an assumption that, human level performance gets nearly 0% error, or more generally, the optimal error (bayes error) is nearly 0%.

High bias and high variance
image

Basic Recipe for Machine Learning

image

“Bias variance trade-off”: increase bias and reduce variance, or reduce bias and increase variance.
However, in the modern deep learning, “Big Data Era”, getting a bigger network almost always reduces the bias without necessarily hurting the variance, so long as with appropriate regularization. And getting more data pretty much always reduces the variance without hurting the bias much.
The ability of Deep Learning to train a deeper network and get more data, drives down both bias and variance, without really hurting the other one much.

Regularizing the Neural Network

Regularization

$\begin{align}
& L1 \ \text{regularization:} \quad \Vert w \Vert_1 = \sum_{i=1}^{n_x} \vert w \vert \cr
& L2 \ \text{regularization:} \quad \Vert w \Vert_2^2 = \sum_{j=1}^{n_x} w_j^2 = w^T w \cr
\end{align}$

If we use $L1$ regularization, $w$ will end up being sparse, which means there will be a lot of zeros in the $w$ vector. Some people says that this can help with compressing the model.

$L2$ regularization for logistic regression:
$$J(w, b) = \frac 1m \sum_{i=1}^m \mathcal{L} (\hat{y} ^{(i)}, y^{(i)}) + \color{red} {\frac {\lambda}{2m} \Vert w \Vert_{_2}^2}$$

  • $\lambda$ is called the regularization parameter, which is also a hyperparameter.
  • Omit the term $\frac {\lambda}{2m} b^2$, because $b$ is just a single number while $w$ is a very high dimensional parameter vector.

$L2$ regularization for neural network:
$$\begin{aligned}
J(W^{[1]}, b^{[1]}, \cdots, W^{[L]}, b^{[L]}) &= \frac 1m \sum_{i=1}^m \mathcal{L} (\hat{y} ^{(i)}, y^{(i)}) + \color{red} {\frac {\lambda}{2m} \sum_{l=1}^L \Vert W^{[l]} \Vert_{_F}^2} \cr
\Vert W^{[l]} \Vert_{_F} ^2 &= \sum_{i=1}^{n^{[l-1]}} \sum_{j=1}^{n^{[l]}} (W_{ij}^{[l]})^2 \cr
\end{aligned}$$

The matrix norm, $\Vert W^{[l]} \Vert_{_F}^2$ is called “Frobenius norm” of the matrix.

Implementing gradient descent with regularization
$\begin{align}
& dW^{[l]} = \frac 1m dZ^{[l]} \cdot [A^{[l-1]}] ^T + \color{red} {\frac {\lambda} m W^{[l]}} \cr
& W^{[l]} = (1 - \frac {\alpha \lambda} m) \cdot W^{[l]} - \alpha \cdot (\frac 1m dZ^{[l]} \cdot [A^{[l-1]}] ^T) \cr
\end{align}$

$L2$ regularization is sometimes called “weight decay” because of $(1 - \frac {\alpha \lambda} m) < 1$.

Why Regularization Reduces Overfitting?

The regularization term $\frac {\lambda}{2m} \sum_{l=1}^{L} \Vert W^{[l]} \Vert_{_F}^2$ is used to penalize the weight matrices $W^{[l]}$ from being too large.

Intuition 1:
image

When setting $\lambda$ to be really large, $W^{[l]} \approx 0$ for a lot of hidden units, so as to simplify the neural network. The simplified neural network will be a much smaller neural network which might transfer the original overfitting case into a high bias case.

Intuition 2:
image

Take the activation function $g(z) = tanh(z)$ as an example. If $z$ is close to zero, the activation function is in its linear region (the red part).
$\lambda \uparrow \quad W^{[l]} \downarrow \quad \Rightarrow \quad z^{[l]} \downarrow \ (z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]})$

If the regularization parameter $\lambda$ is large, the wights $W$ will be penalized being very small, then $z$ will also be relatively small, which leads to $g(z)$ being roughly linear. If every layer is roughly linear, the whole network will become a linear network.

Dropout Regularization

Dropout: set some probabilities of eliminating some nodes in neural network.

To illustrate, for each of these layers, randomly, have a $p$% chance of keeping each node and $(1 - p)$% chance of removing each node, which will end up with a much smaller and diminished network.

image

For different training examples, randomly keep a different set of nodes and then dropout (eliminate) the other nodes.

Implementing dropout (“Inverted dropout”)
The dropout matrix for the layer $3$: $\ d3$

1
2
3
d3 = np.random.randn(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3) # a3 *= d3
a3 /= keep_prob # the inverted dropout technique
  • $keep\_ prob$ is the probability that a given hidden unit will be kept. Assume $keep\_ prob = 0.8$ in this example, and this means that there is a 0.2 chance of eliminating any hidden units.
  • Through this line of code, $d3 = np.random.randn(…) < keep\_ prob$, the random matrix $d3$ will be generated with 80% of the elements equal to one and 20% of the elements equal to zero. Technically, $d3$ is a boolean matrix.
  • 20% of the elements of $a^{[3]}$ will be zeroed out. Hence, in order to not reduce the expected value of $z^{[4]}$ where $z^{[4]} = W^{[4]} a^{[3]} + b^{[4]}$, what should be done here is to divide $a^{[3]}$ by the parameter $keep\_ prob$. Specifically, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value.
  • $d3$ for the third layer, is used to decide what to zero out, both in forward propagation and backward propagation.

NOT to use dropout at test time!
Because we don’t want the output to be random when making predictions at the test time. Implementing dropout at the test time will add noise to the predictions.

Understanding Dropout

  • Using a smaller neural network should have a regularizing effect.
  • A unit cannot rely on any one feature, because any one feature could be eliminated randomly. Hence, the unit will not put much weight on any one input and so have to spread out weights, which will tend to have an effect of shrinking the squared norm of the weights.

It is feasible to vary $keep\_ prop$ by layer.
image

Set $keep\_ prop$ to be a relatively small number for a layer we worry about overfitting. For those we don’t worry about at all, set $keep\_ prop$ to be $1.0$.

Technically, we can also apply dropout to the input layer. In practice, it is quite common to set $keep\_ prop$ to be $1.0$ or a number close to one.

In computer vision, the input size is so large, and there is almost never enough data when inputting all the pixels. Not having enough data leads to almost always overfitting, so dropout is quite frequently used in computer vision.

A downside of dropout: the cost function $J$ is not well-defined.
It is harder to check that the performance of gradient descent is going downhill on every iteration.
The solution to this problem is that, turn off dropout and set $keep\_ prop$ equal to one. Make sure $J$ is monotonically decreasing and then turn on dropout.

Other Regularization Methods

Data augmentation:
image

Early stopping: stop training the neural network halfway
image

  • Advantage: running the gradient descent process just once, we can get to try out values of small $W$, mid-size $W$ and large $W$, without needing to try a lot of values of $L2$ regularization parameter $\lambda$.
  • Disadvantage: break the process of optimizing cost function $J$.

Setting Up the Optimization Problem

Normalizing Inputs

A technique to speed up training.

  • Subtract mean:
    $\begin{align}
    & \mu = \frac 1m \sum_{i=1}^m x^{(i)} \cr
    & x = x - \mu \cr
    \end{align}$
  • Normalize variance:
    $\begin{align}
    & \sigma^2 = \frac 1m \sum_{i=1}^m (x^{(i)}) ^2 \cr
    & x = x / \sigma^2 \cr
    \end{align}$

image

Why normalize inputs?
It is easier to optimize the cost function $J$ when features are all on similar scales. Also, please refer to Notes: Machine Learning by Andrew Ng (Week2).
image

Vanishing / Exploding Gradients

When training a very deep neural network, the derivatives can sometimes get exponentially big or exponentially small, which makes training difficult!

In a very deep network, assume that its activation function is $g(z) = z$ and its bias is $b^{[l]} = 0$.
$\therefore \ \hat{y} = W^{[L]} W^{[L-1]} W^{[L-2]} \cdots W^{[3]} W^{[2]} W^{[1]} x$

  • Assume $W^{[l]} = \begin{bmatrix} 1.5 & 0 \cr 0 & 1.5 \cr \end{bmatrix}$,
    Then, the activation values increase exponentially and the value of $\hat{y}$ will explode, $\ \because \ \hat{y} = \begin{bmatrix} 1.5^L & 0 \cr 0 & 1.5^L \cr \end{bmatrix} x$
  • Assume $W^{[l]} = \begin{bmatrix} 0.5 & 0 \cr 0 & 0.5 \cr \end{bmatrix}$,
    Then, the activation values decrease exponentially and the value of $\hat{y}$ will be exponentially small, $\ \because \ \hat{y} = \begin{bmatrix} 0.5^L & 0 \cr 0 & 0.5^L \cr \end{bmatrix} x$

Weight Initialization for Deep Networks

A partial solution to solve the vanishing / exploding gradients:
A more careful choice of random initialization.

Set the variance of $W$ to be equal to $\frac 1n$:

  • relu
    $W^{[l]} = np.random.randn(\text{SHAPE}) * np.sqrt(\frac 2{n^{[l-1]}})$
    Use $n^{[l-1]}$ here because there are $n^{[l-1]}$ inputs for each of the units in the layer $l$.
  • tanh
    Variance = $\sqrt {\frac 1{n^{[l-1]}}} \ $ (Xavier initialization)

If the input features of activations have the mean roughly equal to $0$ and the standard variance roughly equal to $1$, then this will cause $z$ to take on a similar scale, which definitely helps reduce the vanishing, exploding gradients problem. By this way, the weight matrices $W$ will not be too much bigger than one or not be too smaller than one.

Numerical Approximation of Gradients

Checking the derivative computation:
Make sure $\frac {f(\theta + \epsilon) - f(\theta - \epsilon)}{2 \epsilon} \approx g(\theta)$.

We can verify whether or not a function $g(\theta)$ is a correct implementation of the derivative of a function $f$.

Gradient Checking

Gradient checking is a technique that helps debug in the implementations of backpropagation.

Preprocessing:

  • First, reshape $W^{[1]}, \cdots , W^{[L]}$ into vectors and concatenate all the new $W^{[1]}, \cdots , W^{[L]}$ with the original $b^{[1]}, \cdots , b^{[L]}$ to form a big vector $\theta$.
  • Second, likewise, take $dW^{[1]}, db^{[1]}, \cdots , dW^{[L]}, db^{[L]}$ and reshape into a big vector $d\theta$.

image

Gradient checking (Grad check)
$\begin{align}
& J(\theta) = J(\theta_1, \theta_2, \theta_3, \cdots) \cr
& \cr
& \text{for each} \ i \ (\text{component of } J(\theta)): \cr
& \qquad d\theta_{approx} [i] = \frac {J(\theta_1, \theta_2, \cdots, \theta_i + \epsilon, \cdots) - J(\theta_1, \theta_2, \cdots, \theta_i - \epsilon, \cdots)} {2 \epsilon} \cr
& \qquad \qquad \qquad \approx d\theta [i] = \frac {\partial J} {\partial \theta_i} \cr
& \cr
& \qquad \text{define two vectors are really close to each other:} \cr
& \qquad \text{check} \ \ \frac {\Vert d\theta_{approx} [i] - d\theta [i] \Vert_2} {\Vert d\theta_{approx} [i] \Vert_2 + \Vert d\theta [i] \Vert_2} \le 10^{-7} \ \text{(great!)} \cr
\end{align}$

If the value of the above fraction $> 10^{-3}$, it is highly probable that there is a bug.

Gradient Checking Implementation Notes

  • Don’t use in training, only use it to debug.
    Computing $d\theta_{approx} [i]$ for all the values of $i$ is a very slow computation. Hence, do not run gradient checking at every iteration during training.
  • If algorithm fails grad check, look at components to try to identify bug.
    Search in the different values of $i$ to see whose values of $d\theta_{approx} [i]$ that are very different from values of $d\theta [i]$.
  • Remember Regularization.
  • Grad check doesn’t work with dropout. Run the gradient check algorithm without dropout to make sure the backpropagation is correct, then add dropout.
  • Run at random initialization; perhaps again after some training.

Practical Aspects of Deep Learning

With the inverted dropout technique, at test time:
We do not apply dropout (do not randomly eliminate units) and do not keep the $(1 / keep\_ prob)$ factor in the calculations used in training.

Programming Assignment: Initialization

Zero initialization:

  • The performance is really bad, and the cost does not really decrease, and the algorithm performs no better than random guessing.
  • In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, which is similar to the situation that train a neural network with $n^{[l]}=1$ for every layer, and the network is no more powerful than a linear classifier such as logistic regression.
  • It is okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.

Random initialization with large values:

  • The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when $\log(a^{[3]}) = \log(0)$, the loss goes to infinity.
  • Poor initialization can lead to vanishing, exploding gradients, which also slows down the optimization algorithm.
  • Initializing with overly large random numbers slows down the optimization.

He initialization:

  • He initialization would use $sqrt(2 ./ layers\_ dims[l-1])$.
  • He initialization works well for networks with ReLU activations.

Programming Assignment: Regularization

Backpropagation with dropout:

  • Shut down the same neurons as those shut down during forward propagation, by reapplying the same mask $D^{[l]}$ to $dA^{[l]}$.
  • Then, divide $dA^{[l]}$ by $keep\_ prob$ again (the calculus interpretation is that if $A^{[l]}$ is scaled by $keep\_ prob$, then its derivative $dA^{[l]}$ is also scaled by the same $keep\_ prob$).

Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set.

Notes: Deep Learning Specialization - Neural Networks and Deep Learning (Week4)

Posted on 2019-03-10

Deep Neural Network

Deep L-layer Neural Network

Notation:
$\begin{align}
& l = \# \text{ layers} \cr
& n^{[l]} = \# \text{units in layer} \ l \cr
& a^{[l]} = \text{activations in layers} \ l \cr
& g^{[l]} = \text{activation functions in layer} \ l \cr
\end{align}$

Forward Propagation in a Deep Network

$\begin{align}
& Z^{[l]} = W^{[l]} A^{[l -1]} + b^{[l]} \cr
& A^{[l]} = g^{[l]} (Z^{[l]})
\end{align}$

Getting the Matrix Dimensions Right

The sizes of parameters $W^{[l]}$ and $b^{[l]}$:
$\begin{align}
& W^{[l]}, \ dW^{[l]}: \ \ \ ( {n^{[l]}, \ n^{[l - 1]}} ) \cr
& b^{[l]}, \ db^{[l]}: \ \ \ (n^{[l]}, \ 1 ) \cr
& z^{[l]}, \ a^{[l]}: \ \ \ (n^{[l]}, \ 1 ) \cr
& Z^{[l]}, \ A^{[l]}: \ \ \ (n^{[l]}, \ m ) \cr
\end{align}$

Why Deep Representations?

Intuition about deep representation
image

Take a neural network with simple-to-complex hierarchy as an example:

  • In the 1st hidden layer, those little square boxes in the left visualize the 20 hidden units, which try to figure out where the edges of the orientation are in the image.
  • In the 2nd hidden layer, the hidden units can detect the edges and group edges together to form parts of faces. Hence, there might exist a neuron trying to find an eye and another neuron trying to find a nose.
  • In the 3rd hidden layer, it can start to recongnize different types of faces.

Circuit theory and deep learning
There are functions computed with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

For instance, compute $y = x_1 \ \text{XOR} \ x_2 \ \text{XOR} \ x_3 \ \text{XOR} \ \cdots \ \text{XOR} \ x_n$
image

  • Buiding an XOR tree:
    In the left, it computes the XOR of X1 and X2, then take X3 and X4 and computes their XOR. In the way, the depth of the tree is $O(\log n)$, so we just need several layers to compute the XOR function by using AND or NOT gate, which means the number of nodes or the number of circut components (gates) is not large.
  • Only one hidden layer:
    In the right, in order to compute the XOR function, the hidden layer will be exponentially large, because we must enumerate the $2^n$ possible configurations of the input ($0$ or $1$).

Building Blocks of Deep Neural Networks

image

Forward and Backward Propagation

Forward propagation for layer $l$:
$\begin{align}
& \text{Input} \ a^{[l - 1]} \cr
& \text{Output} \ a^{[l]}, \ \text{cache} (z^{[l]}, \ W^{[l]}, \ b^{[l]}) \cr
& \cr
& \color{red}{Z^{[l]} = W^{[l]} A^{[l -1]} + b^{[l]}} \cr
& \color{red}{A^{[l]} = g^{[l]} (Z^{[l]})} \cr
\end{align}$

Backward propagation for layer $l$:
$\begin{align}
& \text{Input} \ da^{[l]} \cr
& \text{Output} \ da^{[l - 1]}, \ dW^{[l]} \ db^{[l]} \cr
& \cr
& \color{red} { dZ^{[l]} = dA^{[l]} * ( g^{[l]} (Z^{[l]}) ) ^{\prime} } \cr
& \color{red} { dW^{[l]} = \frac 1m dZ^{[l]} \cdot [A^{[l-1]}] ^T } \cr
& \color{red} { db^{[l]} = \frac 1m np.sum( dZ^{[l]}, \ axis=1, keepdims=True ) } \cr
& \color{red} { dW^{[l - 1]} = [W^{[l]}] ^T \cdot dZ^{[l]} } \cr
\end{align}$

The backward propagation is initialized with $da^{[L]} = -\frac ya + \frac {1- y}{1-a}$ when using logistic regression to do binary classification.
Vectorized initialization:
$dA^{[L]} = \sum_{i=1}^m \left( -\frac {y^{(i)}} {a^{(i)}} + \frac {1 - y^{(i)}} {1 - a^{(i)}} \right)$

Parameters VS Hyperparameters

  • Parameters:
    $W^{[1]}, \ b^{[1]}, \ W^{[2]}, \ b^{[2]}, \ W^{[1]}, \ b^{[1]}, \cdots$
  • Hyperparameters: determine the final value of the parameters $W$ and $b$.
    $\begin{align}
    & \text{learning rate} \ \alpha \cr
    & \# \text{ iterations} \cr
    & \# \text{ hidden layer} \ L \cr
    & \# \text{ hidden units} \ n^{[1]}, \ n^{[2]}, \cdots \cr
    & \text{choice of activation functions} \cr
    & \text{momentum} \cr
    & \text{minibatch size} \cr
    & \text{regulization parameter} \cr
    \end{align}$

Applied deep learning is a very empirical process!

Notes: Deep Learning Specialization - Neural Networks and Deep Learning (Week3)

Posted on 2019-03-05

Shallow Neural Network

Neural Network Representation

$\quad \LARGE{a} \begin{align}
& \small{[L]} \quad \small{\leftarrow \text{layer}} \newline
& \normalsize{i} \quad \small{\leftarrow \text{node in layer}} \newline
\end{align}$

Computing a Neural Network’s Output

Two steps of computation in Logistic Regression, represented by the circle in the following:
image

In Logistic Regression, compute $z = w^T x + b$, and then, compute the activation as a sigmoid function of z by $a = \sigma (z)$. A neural network just does this a lot more times.
image

The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.

In each neuron of the hidden layer, do the two-step computation:
$\begin{align}
& z_1^{[1]} = [w_1^{[1]}] ^T x + b_1^{[1]}, \quad a_1^{[1]} = \sigma (z_1^{[1]}) \newline
& z_2^{[1]} = [w_2^{[1]}] ^T x + b_2^{[1]}, \quad a_2^{[1]} = \sigma (z_2^{[1]}) \newline
& z_3^{[1]} = [w_3^{[1]}] ^T x + b_3^{[1]}, \quad a_3^{[1]} = \sigma (z_3^{[1]}) \newline
& z_4^{[1]} = [w_4^{[1]}] ^T x + b_4^{[1]}, \quad a_4^{[1]} = \sigma (z_4^{[1]}) \newline
& \newline
& a^{[1]} = \begin{bmatrix} a_1^{[1]} \newline a_2^{[1]} \newline a_3^{[1]} \newline a_4^{[1]} \end{bmatrix} \newline
\end{align}$

It is really inefficient to implement a neural network by doing computation with iteration. Hence, choose to vectorize:
$$z^{[1]} =
\begin{bmatrix}
— \ [w_1^{[1]}] ^T \ — \newline
— \ [w_2^{[1]}] ^T \ — \newline
— \ [w_3^{[1]}] ^T \ — \newline
— \ [w_4^{[1]}] ^T \ — \newline
\end{bmatrix} \cdot
\begin{bmatrix}
x_1 \newline x_2 \newline x_3
\end{bmatrix} +
\begin{bmatrix}
b_1^{[1]} \newline b_2^{[1]} \newline b_3^{[1]} \newline b_4^{[1]}
\end{bmatrix} =
\begin{bmatrix}
[w_1^{[1]}] ^T + b_1^{[1]} \newline
[w_2^{[1]}] ^T + b_2^{[1]} \newline
[w_3^{[1]}] ^T + b_3^{[1]} \newline
[w_4^{[1]}] ^T + b_4^{[1]} \newline
\end{bmatrix} =
\begin{bmatrix}
z_1^{[1]} \newline z_2^{[1]} \newline z_3^{[1]} \newline z_4^{[1]}
\end{bmatrix}
$$

Therefore, given input $x$:
$\begin{align}
& z^{[1]} = W^{[1]} x + b^{[1]} &\qquad &\color{blue}{\small{(4,1) = (4,3) \cdot (3,1) + (4,1)}} \newline
& a^{[1]} = \sigma (z^{[1]}) &\qquad &\color{blue}{\small{(4,1) = (4,1)}} \newline
& z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} &\qquad &\color{blue}{\small{(1,1) = (1,4) \cdot (4,1) + (1,1)}} \newline
& a^{[2]} = \sigma (z^{[2]}) &\qquad &\color{blue}{\small{(1,1) = (1,1)}} \newline
\end{align}$

Vectorizing Across Multiple Examples

An unvectorized implementation of computing all the outputs on the $m$ training examples:
$\begin{align}
& \text{for} \ i = 1: m \newline
& \quad z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1]} \newline
& \quad a^{[1] (i)} = \sigma (z^{[1] (i)}) \newline
& \quad z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2]} \newline
& \quad a^{[2] (i)} = \sigma (z^{[2] (i)}) \newline
& \text{end for} \newline
\end{align}$

To illustrate, $\ \large a^{[\text{layer} \ j] \ (\text{example} \ i)}$.

Vectorize across multiple examples
$\color{red}{\begin{align}
& Forward \ propagation: \newline
& \quad Z^{[1]} = W^{[1]} X + b^{[1]} \newline
& \quad A^{[1]} = \sigma (Z^{[1]}) \newline
& \quad Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \newline
& \quad A^{[2]} = \sigma (Z^{[2]}) \newline
\end{align}}$

In detail, take $Z^{[1]}$ and $A^{[1]}$ as examples:
$Z^{[1]} = \begin{bmatrix} \vert & \vert & & \vert \newline z^{[1] (1)} & z^{[1] (2)} & \cdots & z^{[1] (m)} \newline \vert & \vert & & \vert \end{bmatrix}, \quad A^{[1]} = \begin{bmatrix} \vert & \vert & & \vert \newline a^{[1] (1)} & a^{[1] (2)} & \cdots & a^{[1] (m)} \newline \vert & \vert & & \vert \end{bmatrix}$

  • Horizontally, from left to right, go through the $m$ training examples, so the horizontal index corresponds to different training examples.
  • Vertically, from up to down, the vertical index corresponds to different hidden units in the neural network.

Explanation for Vectorized Implementation

Assume there are only $3$ training examples, and $b^{[1]} = 0$.
Therefore, $z^{[1] (1)} = W^{[1]} x^{(1)}, \ \ z^{[1] (2)} = W^{[1]} x^{(2)}, \ \ z^{[1] (3)} = W^{[1]} x^{(3)}$.
$\because W^{[1]} \ $ is a matrix
$\therefore W^{[1]} \color{purple}{x^{(1)} = \begin{bmatrix}
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE\cdot \newline
\end{bmatrix}}, \quad
W^{[1]} \color{green}{x^{(2)} = \begin{bmatrix}
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE\cdot \newline
\end{bmatrix}}, \quad
W^{[1]} \color{orange}{x^{(3)} = \begin{bmatrix}
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE \cdot \newline
\LARGE\cdot \newline
\end{bmatrix}}$

$\therefore W^{[1]} X = W^{[1]} \cdot
\begin{bmatrix}
\color{purple}{\vert} & \color{green}{\vert} & \color{orange}{\vert} \newline
\color{purple}{x^{(1)}} & \color{green}{x^{(2)}} & \color{orange}{x^{(3)}} \newline
\color{purple}{\vert} & \color{green}{\vert} & \color{orange}{\vert}
\end{bmatrix} =
\begin{bmatrix}
\LARGE \color{purple}{\cdot} & \LARGE \color{green}{\cdot} & \LARGE \color{orange}{\cdot} \newline
\LARGE \color{purple}{\cdot} & \LARGE \color{green}{\cdot} & \LARGE \color{orange}{\cdot} \newline
\LARGE \color{purple}{\cdot} & \LARGE \color{green}{\cdot} & \LARGE \color{orange}{\cdot} \newline
\LARGE \color{purple}{\cdot} & \LARGE \color{green}{\cdot} & \LARGE \color{orange}{\cdot} \newline
\end{bmatrix} =
\begin{bmatrix}
\color{purple}{\vert} & \color{green}{\vert} & \color{orange}{\vert} \newline
\color{purple}{z^{[1] (1)}} & \color{green}{z^{[1] (2)}} & \color{orange}{z^{[1] (3)}} \newline
\color{purple}{\vert} & \color{green}{\vert} & \color{orange}{\vert}
\end{bmatrix} = Z^{[1]}$

Intuitively, the vector $\begin{bmatrix} \LARGE \cdot \newline \LARGE \cdot \newline \LARGE \cdot \newline \LARGE\cdot \newline \end{bmatrix}$ can be regarded as an computational representation of nodes in the hidden layer, shown in the following picture. Hence, $m$ training examples will result in $m$ those vectors, which can be combined into a matrix with the size of $(NODES\_ OF\_ HIDDEN\_ LAYER, \ m)$.
image

Activation Functions and Their Derivatives

  • sigmoid:
    $\begin{align}
    & a = \frac 1{1 + e^{-z}} \cr
    & a’ = \frac {e^{-z}}{(1 + e^{-z}) ^2} = \frac {e^{-z}}{1 + e^{-z}} \cdot \frac 1{1 + e^{-z}} = \frac {e^{-z}}{1 + e^{-z}} \cdot (1 - \frac {e^{-z}} {1 + e^{-z}}) = a (1 - a) \cr
    \end{align}$
    image
  • tanh:
    $\begin{align}
    a = \frac {e^z - e^{-z}} {e^z + e^{-z}} \ \ \ \ \ & \cr
    a ’ = \frac 4 {(e^z + e^{-z}) ^2}
    & = \frac {(e^z + e^{-z}) ^2 - (e^z - e^{-z}) ^2} {(e^z + e^{-z}) ^2} \cr
    & = \frac {(e^z + e^{-z}) ^2} {(e^z + e^{-z}) ^2} - \frac {(e^z - e^{-z}) ^2} {(e^z + e^{-z}) ^2} \cr
    & = 1 - ( \frac {e^z - e^{-z}} {e^z + e^{-z}} ) ^2 \cr
    & = 1 - (tanh(z)) ^2 \cr
    & = 1 - a^2 \cr
    \end{align}$
    image
  • ReLU:
    $\begin{align}
    & a = max(0, z) \newline
    & a’ = \begin{cases}
    0 & \text{if} \ z < 0 \newline
    1 & \text{if} \ z > 0 \newline
    undefined & \text{if} \ z = 0 \newline
    \end{cases}
    \end{align}$
    About the “undefined”, if we implement the $ReLU$ function in software, it’ll work just fine to set the derivative to be equal to 1 when $z$ is exactly zero.
    image

  • Leaky ReLU:
    $\begin{align}
    & a = max(0.01z, z) \newline
    & a’ = \begin{cases}
    0.01 & \text{if} \ z < 0 \newline
    1 & \text{if} \ z \ge 0 \newline
    \end{cases}
    \end{align}$
    image

Pros and cons of activation functions:

  • Almost never use the sigmoid function, except for the output layer when doing binary classification.
  • The tanh function is almost always superior to sigmoid function, because the mean of the activations are close to $0$. It is reasonable to center the data so that the mean of the data is closer to $0$ rather than $0.5$, which makes learning for the next layer a little bit easier.
  • One of the downsides of both the sigmoid function and the tanh function is that if z is either very large or very small, then the derivative of this function becomes very small (close to $0$), which slows down gradient descent.
  • The leaky ReLU usually works better than the ReLU activation function.
  • The advantage of both the ReLU and the leaky ReLU is that, using the ReLU activation function, the neural network will often learn much faster than using the tanh or the sigmoid activation function.

Why Needing Non-linear Activation Functions?

Assume the activation function is a linear activation function $g(z) = z$.
Given $x$:
$\begin{align}
a^{[1]} = z^{[1]} &= W^{[1]} x + b^{[1]} \newline
a^{[2]} = z^{[2]} &= W^{[2]} a^{[1]} + b^{[2]} \newline
& = W^{[2]} (W^{[1]} x + b^{[1]}) + b^{[2]} \newline
& = (W^{[2]} W^{[1]}) x + (W^{[2]} b^{[1]} + b^{[2]}) \newline
& = W’ x + b’ \newline
\end{align}$

The result of using linear activation function or identity activation function is that the neural network just outputs a linear function of the input, no matter how many layers the neural network has.

Gradient Descent for Neural Networks

Notation:
$n^{[0]} = n_x$: the number of input features
$n^{[1]}$: the number of hidden units
$n^{[2]}$: the number of output units
$g^{[1]}$: the activation function of the (first) hidden layer
$g^{[2]}$: the activation function of the output layer

Parameters:
$W^{[1]}$ with the size of $(n^{[1]}, \ n^{[0]})$
$b^{[1]}$ with the size of $(n^{[1]}, \ 1)$
$W^{[2]}$ with the size of $(n^{[2]}, \ n^{[1]}), n^{[2]} = 1$
$b^{[2]}$ with the size of $(n^{[2]}, \ 1)$

Formulas for computing derivatives
$\color{red}{\begin{align}
& Backpropagation: \newline
& \quad dZ^{[2]} = A^{[2]} - Y \newline
& \quad dW^{[2]} = \frac 1m dZ^{[2]} \ [A^{[1]}] ^T \newline
& \quad db^{[2]} = \frac 1m np.sum(dZ^{[2]}, \ axis=1, \ keepdims=True) \newline
& \quad dZ^{[1]} = [W^{[2]}] ^T \ dZ^{[2]} * \left( g^{[1]} (Z^{[1]}) \right) ^{\prime} \newline
& \quad dW^{[1]} = \frac 1m dZ^{[1]} X^T \newline
& \quad db^{[1]} = \frac 1m np.sum(dZ^{[1]}, \ axis=1, \ keepdims=True) \newline
\end{align}}$

$keepdims$ prevents Python from outputting the “rank 1 array” - $(n, )$. So, having keepdims equals ensures that Python outputs a vector with the size of $(n, 1)$.

Backpropagation Intuition

Neural network gradients:
image

The computation of $dW^{[2]}$ and $db^{[2]}$ is same as the one of Logistic Regression. Then, here is a detailed explanation of how to compute $dz^{[1]}$:
$\begin{align}
dz^{[1]} &= \frac {d \mathcal{L}}{dZ^{[1]}} \cr
&= \frac {\partial \mathcal{L}} {\partial a^{[2]}} \cdot \frac {da^{[2]}} {dz^{[2]}} \cdot \frac {dz^{[2]}} {dz^{[1]}} \cr
&= ( \frac {\partial \mathcal{L}} {\partial a^{[2]}} \cdot \frac {da^{[2]}} {dz^{[2]}} ) \cdot \frac {dz^{[2]}} {da^{[1]}} \cdot \frac {da^{[1]}} {dz^{[1]}} \cr
&= dz^{[2]} \cdot \frac d {da^{[1]}} (W^{[2]} a^{[1]} + b^{[2]}) * g^{\prime} (z^{[1]}) \cr
&= dz^{[2]} \cdot [W^{[2]}] ^T * g^{\prime} (z^{[1]}) \cr
\end{align}$

Note that for any variable $var$ and $dvar$ always have the same dimension.

Random Initialization

It’s important to initialize the weights randomly. For logistic regression, it is OK to initialize the weights to zero. However, for a neural network, initializing the weights to all zero will result in problems.

The result of initializing weights to zero is that, all hidden units compute exactly the same function and output the same value, no matter how many iterations are used to train the neural network. Hence, there’s really no point to having more than one hidden unit, because they all compute the same thing.

The solution to the above-mentioned problem is to initialize weights randomly:
$\begin{align}
& W^{[1]} = np.random.randn(2, 2) * 0.01 \cr
& b^{[1]} = np.zeros((2, 1)) \cr
& \cdots \cr
\end{align}$

The reason of multiplying $0.01$:
If $W$ is too large, it is more likely to end up with a very large value of $z$, which causes the tanh or the sigmoid activation function to stagnate in the position where the gradient is close to zero, thus slowing down learning. If there are few sigmoid or tanh activation functions used in the neural network, this is less of an issue.

The General Methodology to Build a Neural Network Is To

  1. Define the neural network structure ( # of input units, # of hidden units, etc).
  2. Initialize the model’s parameters
  3. Loop:
    • Implement forward propagation
    • Compute loss
    • Implement backward propagation to get the gradients
    • Update parameters (gradient descent)

Notes: Deep Learning Specialization - Neural Networks and Deep Learning (Week1, 2)

Posted on 2019-03-02

Introduction to Deep Learning

What Is a Neural Network?

A neural network is a network or circuit of artificial neurons. An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. The artificial neuron receives one or more inputs and sums them to produce an output (or activation, representing a neuron’s action potential which is transmitted along its axon).

Supervised Learning with Neural Networks

For more information about Supervised Learning, click on the link or refer to this blog: Notes: Machine Learning by Andrew Ng (Week1).

Structured Data VS Unstructured Data:
image

Why Is Deep Learning Taking Off?

image

Scale drives deep learning process:

  • Data
  • Computation
  • Algorithm

Logistic Regression as a Neural Network

Binary Classification

Binary classification is the task of classifying the elements of a given set into two groups.

Notation

  • $(x, y)$ is a single training example
  • $x \in \mathbb{R}^{n_x}$ is an $n_x$-feature vector
  • $y \in \lbrace 0, 1 \rbrace$ is a label
  • $m$ training examples: $\lbrace (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(m)}, y^{(m)}) \rbrace$
    Use matrix $X$ to represent the input data with $m$ columns and $n_x$ rows:
    $X = \begin{bmatrix} \vert & \vert & & \vert \newline x^{(1)} & x^{(2)} & \cdots & x^{(m)} \newline \vert & \vert & & \vert \end{bmatrix} \in \mathbb{R}^{n_x \times m}$
    Similarly,
    $Y = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix} \in \mathbb{R}^{1 \times m}$

Logistic Regression

Given $x$, want $\hat{y} = p(y = 1 \vert x)$:

  • input: $x \in \mathbb{R}^{n_x}$
  • parameters: $w \in \mathbb{R}^{n_x}$, $b \in \mathbb{R}$
  • output: $\hat{y} = \sigma (w^T x + b)$, where $\sigma (z) = \frac 1{1 + e^{-z}}$

Notably, $\theta = \begin{bmatrix} \theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(n_x)} \end{bmatrix} ^T$ is used to represent parameters. However, to implement neural networks in deep learning, it is easier to use parameters $w$ and $b$ separately, where $b$ represents $\theta_0$ and $w$ represents $\theta_1, \theta_2, …, \theta_{n_x}$.
For more about Logistic Regression, refer to Notes: Machine Learning by Andrew Ng (Week3).

Logistic Regression Cost Function

Given training set $\lbrace (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(m)}, y^{(m)}) \rbrace$,want $\hat{y}^{(i)} \approx y^{(i)}$.
Loss (error) function:
$$\mathcal{L} (\hat{y} - y) = - \left[ y \log \hat{y} + (1 - y) \log (1 - \hat{y}) \right]$$

Loss function is used to measure how good the output $\hat{y}$ is when the true label is $y$:

  • If $y = 1$:
    $\mathcal{L} (\hat{y} - y) = - \log \hat{y} \Rightarrow $ want $\ \log \hat{y} \ $ large $\Leftrightarrow$ want $\ \hat{y} \ $ large, $\ \therefore \hat{y} \rightarrow 1$
  • If $y = 0$:
    $\mathcal{L} (\hat{y} - y) = - \log (1 - \hat{y}) \Rightarrow $ want $\ \log (1 - \hat{y}) \ $ large $\Leftrightarrow$ want $\ \hat{y} \ $ small, $\ \therefore \hat{y} \rightarrow 0$

Not use $\mathcal{L} (\hat{y}, y) = \frac 12 (\hat{y} - y)^2$ in Logistic Regression, because the optimization problem becomes non-convex and ends up with multiple local optima.

Cost function:
$$\begin{align*} J(w, b) &= \frac 1m \sum_{i=1}^m \mathcal{L} (\hat{y}^{(i)} - y^{(i)})) \newline &= - \frac 1m \sum_{i=1}^m \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right] \end{align*}$$

Gradient Descent

The process to find $w$, $b$ that minimize $J(w, b)$:
$Repeat \ \lbrace$
$\begin{align*} &\quad w = w - \alpha \frac {\partial J(w, b)}{\partial w} \newline &\quad b = b - \alpha \frac {\partial J(w, b)}{\partial b} \end{align*}$
$\rbrace$

Use variable “$dw$” to represent $\frac {\partial J(w, b)}{\partial w}$, and similarly, “$db$” to represent $\frac {\partial J(w, b)}{\partial b}$ in code.

Derivatives

Derivative formulas of elementary functions:

$({C})’ = 0$ $(x^a)’ = ax^{a-1} \ (a \in \mathbb{R}, x > 0)$
$(\tan \ x)’ = \sec^2 \ x$ $(\cot \ x)’ = -\csc^2 \ x$
$(\sec \ x)’ = \sec \ x \tan \ x$ $(\csc \ x)’ = -\csc \ x \cot \ x$
$(a^x)’ = a^x \ln a \ (a > 0, a \ne 1)$ $(e^x)’ = e^x$
$( \log _a \vert x \vert)’ = \frac 1{x \ln \ a} (a > 0, a \ne 1)$ $ (\ln \vert x \vert)’ = \frac 1x$
$(\arcsin \ x)’ = \frac 1{\sqrt{1 - x^2}}$ $(\arccos \ x)’ = -\frac 1{\sqrt{1 - x^2}}$
$(\arctan \ x)’ = \frac 1{1 + x^2}$ $(\mathrm{arccot} \ x)’ = -\frac 1{1 + x^2}$

Derivatives with a Computation Graph

The computations of a neural network consist of a forward propagation step and a backward propagation step:

  • Forward propagation: to compute the output of the neural network.
  • Backward propagation: to compute gradients or derivatives.

Computation graph:
image

Use $“dvar”$ to represent $\frac {d \ FinalOutputVar}{d \ var}$ or $\frac {\partial \ FinalOutputVar}{\partial \ var}$. For example, $db$ = $\frac {dJ}{db}$.

The chain rule is used to compute derivatives. Take $db$ as an example:
$\begin{align}
&\because \frac {dJ}{db} = \frac {dJ}{du} \cdot \frac {du}{db}, \quad \frac {dJ}{du} = \frac {dJ}{dv} \cdot \frac {dv}{du} = 3 \newline &\therefore db = 3 \frac {du}{db} = 3 \cdot c = 6
\end{align}$

Logistic Regression Gradient Descent

Logistic regression derivatives:
image

According to the derivative formulas:
$\begin{align}
&\color{red}{da = \frac {d \mathcal{L} (a, y)}{da} = -\frac ya + \frac {1-y}{1-a}} \newline
&\color{red}{dz = \frac {d \mathcal{L}}{dz} = \frac {d \mathcal{L}}{da} \cdot \frac {da}{dz}} \newline
&\color{red}{\because \frac {da}{dz} = \sigma '(z) = \frac {e^{-z}}{(1 + e^{-z}) ^2} = \frac 1{1 + e^{-z}} \cdot \frac {e^{-z}}{1 + e^{-z}} = a \cdot (1- a)} \newline
&\color{red}{\therefore dz = a - y} \newline
&\quad \color{red}{dw_1 = \frac {d \mathcal{L}}{d w_1} = \frac {d \mathcal{L}}{dz} \cdot \frac {dz}{dw_1} = x_1 \cdot dz} \newline
&\quad \color{red}{dw_2 = x_2 \cdot dz, \quad db = dz}
\end{align}$

Gradient Descent on m Examples

$$\frac {\partial}{\partial w_k} J(w, b) = \frac 1m \sum_{i=1}^m \frac {\partial}{\partial w_k} \mathcal{L} (a^{(i)}, y^{(i)}) = \frac 1m \sum_{i=1}^m dw_k^{(i)}$$

Algorithm:
$\begin{align}
&J = 0; \ dw_1 = 0; \ dw_2 = 0; \ …; \ dw_{n_x} = 0; \ db = 0 \newline
& \text{for} \ i = 1: m \ \lbrace \newline
& \quad z^{(i)} = w^T x^{(i)} + b \newline
& \quad a^{(i)} = \sigma (z^{(i)}) \newline
& \quad J \ \text{+=} \ - \left[ y^{(i)} \log a^{(i)} + (1 - y^{(i)} \log (1 - a^{(i)})) \right] \newline
& \quad dz^{(i)} = a^{(i)} - y^{(i)} \newline
& \quad db \ \text{+=} \ dz^{(i)} \newline
& \quad \text{for} \ k = 1: n_x \ \lbrace \newline
& \quad \quad dw_k \ \text{+=} \ x_k^{(i)} \cdot dz^{(i)} \newline
& \quad \rbrace \newline
& \rbrace \newline
& \newline
& J \ \text{/=} \ m; \quad db \ \text{/=} \ m \newline
& \text{for} \ k = 1: n_x \ \lbrace \newline
& \quad dw_k \ \text{/=} \ m \newline
& \rbrace \newline
& \newline
& w_1 = w_1 - \alpha \cdot dw_1 \newline
& w_2 = w_2 - \alpha \cdot dw_2 \newline
& \vdots \newline
& w_{n_x} = w_{n_x} - \alpha \cdot dw_{n_x} \newline
& b = b - \alpha \cdot db \newline
\end{align}$

Python and Vectorization

Vectorizing Logistic Regression

  • Vectorize the for loop updating $dw_k$, where $\ dw_k \ \text{+=} \ x_k^{(i)} \cdot dz^{(i)}$:
    $\begin{align}
    &dw = np.zeros([n_x, 1]) \newline
    &dw \ \text{+=} x^{(i)} \ dz^{(i)} \newline
    &dw \ \text{/=} \ m
    \end{align}$
  • Vectorize the for loop computing $z^{(i)} = w^T x^{(i)} + b$, $a^{(i)} = \sigma (z^{(i)})$ through $m$ training examples:
    $Z = w^T X + b, \quad A = \sigma (Z)$
  • Vectorize the computations in backward propagation, where $\ dz^{(i)} = a^{(i)} - y^{(i)}$:
    $dZ = A - Y$

Algorithm with vectorization:
$\begin{align}
& Z = w^T X + b = np.dot(w.T, X) + b \newline
& A = \sigma (Z) \newline
& dZ = A - Y \newline
& dw = (1 / m) \ X dZ^T \newline
& db = (1 / m) \ np.sum(dZ) \newline
& \newline
& w = w - \alpha \cdot dw \newline
& b = b - \alpha \cdot db
\end{align}$

Broadcasting in Python

Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

For example: Calories from Carbs, Proteins, Fats in 100g of different foods:
image

How to calculate the portion of calories from Carb, Protein and Fat respectively?

1
2
cal = A.sum(axis = 0)  
percentage = 100 * A / cal # broadcasting

An example of broadcasting in practice:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
>>> x = np.arange(4)
>>> xx = x.reshape(4,1)
>>> y = np.ones(5)
>>> z = np.ones((3,4))

>>> x.shape
(4,)

>>> y.shape
(5,)

>>> x + y
ValueError: operands could not be broadcast together with shapes (4,) (5,)

>>> xx.shape
(4, 1)

>>> y.shape
(5,)

>>> (xx + y).shape
(4, 5)

>>> xx + y
array([[ 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2.],
[ 3., 3., 3., 3., 3.],
[ 4., 4., 4., 4., 4.]])

>>> x.shape
(4,)

>>> z.shape
(3, 4)

>>> (x + z).shape
(3, 4)

>>> x + z
array([[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.],
[ 1., 2., 3., 4.]])

A Note on Python/Numpy Vectors

Don’t use $np.random.randn(\color{blue} 5)$ to initialize a vector or a matrix in Python, because the variable initialized by this way is a “rank 1 array”, which will lead to unexpected results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> a = np.random.randn(5)
>>> a.shape
(5,)
>>> np.dot(a, a.T)
1.3594460613628967
>>>
>>> # Using the right shape as the parameter to initialize the vector or matrix does the right thing:
...
>>> b = np.random.randn(5,1)
>>> np.dot(b, b.T)
array([[ 0.67470791, 0.6236726 , 0.90482273, -0.68818036, 0.96820675],
[ 0.6236726 , 0.57649763, 0.83638139, -0.63612598, 0.89497101],
[ 0.90482273, 0.83638139, 1.21342014, -0.92289006, 1.29842182],
[-0.68818036, -0.63612598, -0.92289006, 0.70192182, -0.98753973],
[ 0.96820675, 0.89497101, 1.29842182, -0.98753973, 1.38937798]])
>>>
>>> assert(b.shape == (5, 1))

It is important to use $reshape$ operation to make sure the matrices or the vectors are the dimension that we need it to be.

Explanation of Logistic Regression Cost Function

$\begin{align}
& \hat{y} = p(y = 1 \vert x) \newline
& \text{If} \ \ y = 1: \quad p(y \vert x) = \hat{y} \newline
& \text{If} \ \ y = 0: \quad p(y \vert x) = 1 - \hat{y}
\end{align}$

Summarize the above-mentioned equations into one formula:
$$p(y \vert x) = \hat{y} ^y (1 - \hat{y})^{(1-y)}$$
$\begin{align}
\because \ & \text{If} \ \ y = 1, \quad p(y \vert x) = \hat{y}^1 \cdot (1 - \hat{y})^ 0 = \hat{y} \cdot 1 = \hat{y} \newline
& \text{If} \ \ y = 0, \quad p(y \vert x) = \hat{y}^0 \cdot (1 - \hat{y})^{(1-0)} = 1 \cdot (1 - \hat{y})^1 = 1 - \hat{y} \newline
\end{align}$

$\begin{align}
\therefore \ \color{red}{\log p(y \vert x)} &\color{red}{= \log \ \hat{y} ^y (1 - \hat{y})^{(1-y)}} \newline
& \color{red}{= y \log \hat{y} + (1- y) \log (1 - \hat{y})} \newline
& \color{red}{= - \mathcal{L} (\hat{y}, y)}
\end{align}$

Notes: Machine Learning by Andrew Ng (Week6)

Posted on 2019-02-20
123

Giaosam

Per Aspera Ad Astra.

21 posts
6 tags
© 2019 Giaosam
Powered by Hexo
|
Theme — NexT.Muse v5.1.4